Closing the Vocabulary Gap for Computing Text Similarity and Information Retrieval

نویسندگان

  • Christof Müller
  • Iryna Gurevych
  • Max Mühlhäuser
چکیده

This paper studies the integration of lexical semantic knowledge in two related semantic computing tasks: ad-hoc information retrieval and computing text similarity. For this purpose, we compare the performance of two algorithms: (i) using semantic relatedness, and (ii) using a conventional extended Boolean model [13] with additional query expansion. For the evaluation, we use two different test collections in the German language especially suitable to study the vocabulary gap problem: (i) GIRT [5] for the information retrieval task, and (ii) a collection of descriptions of professions built to evaluate a system for electronic career guidance in the information retrieval and text similarity tasks. We found that integrating lexical semantic knowledge increases the performance for both tasks. On the GIRT corpus, the performance is improved only for short queries. The performance on the collection of professional descriptions is improved, but crucially depends on the accurate preprocessing of the natural language essays employed as topics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Closing the Service Discovery Gap by Collaborative Tagging and Clustering Techniques

Whereas the number of services that are provided online is growing rapidly, current service discovery approaches seem to have problems fulfilling their objectives. Existing approaches are hampered by the complexity of underlying semantic service models and by the fact that they try to impose a technical vocabulary to users. This leads to what we call the service discovery gap. In this paper we ...

متن کامل

Information Retrieval based on Paraphrase

Text Retrieval systems based on ranking use similarity as an approximation to relevance. Most of these systems ignore word meaning. We assume that some measure of paraphrase would be a better similarity measure. We develop a concept of paraphrase based on Meaning-Text Theory and implement an approximation to the ideal using the Longman Dictionary of Contemporary English (LDOCE). The performance...

متن کامل

بررسی نقش انواع بافتار هم‌نویسه‌ها در تعیین شباهت بین مدارک

Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...

متن کامل

Remedies against the Vocabulary Gap in Information Retrieval

Search engines rely heavily on term-based approaches that represent queries and documents as bags of words. Text---a document or a query---is represented by a bag of its words that ignores grammar and word order, but retains word frequency counts. When presented with a search query, the engine then ranks documents according to their relevance scores by computing, among other things, the matchin...

متن کامل

Studying the Effect of Retrieval Direction during Reading on Productive and Receptive Knowledge of Vocabulary

Retrieval tasks provide learners with an opportunity to focus both on meaning and on form. There are four different retrieval directions. The present study aimed to identify the optimal direction of recall type retrievals during reading and to investigate the outcomes of each one. Forty-eight intermediate EFL learners took part in the study. One of the experimental groups was provided with the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. Semantic Computing

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2008